Overview

This journal contains plots created from the training and testing datasets.

# Load libraries
library(tidyverse)
library(cowplot)
library(plotly)

The Data

The cleaning training and testing datasets are loaded in below.

# Load in the training data (Hamby Data 173 and 252) and testing data (Hamby Data 224 Sets 1 and 11)
hamby173and252_train <- read.csv("../data/hamby173and252_train.csv")
hamby224_test <- read.csv("../data/hamby224_test.csv")

A vector containing the features used in the random forest rtrees is created below that will be used when creating some of the plots in this journal.

# Obtain features used when fitting the rtrees random forest
rf_features <- rownames(bulletr::rtrees$importance)

Plots of Features

Distributions by samesource

I created the following two plots of the training data as suggested by Heike. The histograms below show the distributions of the features used in the random forest rtrees. The histograms are filled by the samesource variable, which is the truth of whether or not the comparison is from the same barrel and land. The default histograms make it hard to compare the distributions of the matches and non-matches since there are many more comparisons that have samesource == FALSE.

# Create plots of the feature distributions colored by samesource for the training data
hamby173and252_train %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(bins = 30) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Frequency", fill = "Same Source?",
       title = "Hamby 173 and 252 Training Data") +
  theme_bw() +
  theme(legend.position = "bottom")

By setting position = "fill" in the geom_histogram function, it is easier to compare the matches and non-matches. These plots could be used in the future to hand select the bins for lime. Additionally, fitting a logistic regression to this data could also be used to determine the LC50, LC10, ad LC90, which could be used as the bins for lime.

# Create plots of the feature distributions colored by samesource for the training data
# using the position = "fill" option
hamby173and252_train %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(position = "fill", bins = 30) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Proportion", fill = "Same Source?",
       title = "Hamby 173 and 252 Training Data") +
  theme_bw() +
  theme(legend.position = "bottom")

The plots below have the same structure, but they are created with the Hamby 224 testing data. I chose to not separate the testing data by sets, but this is something that could be done later if necessary.

# Create plots of the feature distributions colored by samesource for the testing data
hamby224_test %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(bins = 15) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Frequency", fill = "Same Source?",
       title = "Hamby 224 Testing Data") +
  theme_bw() +
  theme(legend.position = "bottom")

# Create plots of the feature distributions colored by samesource for the testing 
# data using the position = "fill" option
hamby224_test %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(position = "fill", bins = 15) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Proportion", fill = "Same Source?",
       title = "Hamby 224 Testing Data") +
  theme_bw() +
  theme(legend.position = "bottom")

Correlations

I made these plots to look at the correlation between features in the training data within the TRUE and FALSE cases of samesource. The features are highly correlated for the match comparisons. It is clear that the variables are more correlated with the match comparisons than the non-match comparisons. However, there are still some variables that are relatively highly correlation with the non-match comparisons.

# Create a correlation heatmap of the match comparisons in the training data
cor_match <- hamby173and252_train %>%
  select(rf_features, samesource) %>%
  filter(samesource == TRUE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") + 
  labs(x = "", y = "", fill = "Correlation",
       title = "Match Comparisons")

# Create a correlation heatmap of the non-match comparisons in the training data
cor_nonmatch <- hamby173and252_train %>%
  select(rf_features, samesource) %>%
  filter(samesource == FALSE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  labs(x = "", y = "", fill = "Correlation",
       title = "Non-Match Comparisons")

# Create a title for the panel of plots
cor_title <- ggdraw() +
  draw_label("Correlation of Feature Variables in the Training Data", 
             fontface = "bold", 
             size = 16,
             x = 0.5,
             hjust = 0.5)

# Create a joined legend for the panel of plots
cor_legend <- get_legend(cor_match + theme(legend.position = "right"))

# Create the panel of plots
plot_grid(cor_title, 
          plot_grid(cor_match, cor_nonmatch, cor_legend, ncol = 3, rel_widths = c(2, 2, 0.5)),
          ncol = 1,
          rel_heights = c(0.25, 3))

The plots below show the correlations for the testing data. The patterns in the plots look really similar to ones of the training data.

# Create a correlation heatmap of the match comparisons in the training data
cor_match_test <- hamby224_test %>%
  select(rf_features, samesource) %>%
  filter(samesource == TRUE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") + 
  labs(x = "", y = "", fill = "Correlation",
       title = "Match Comparisons")

# Create a correlation heatmap of the non-match comparisons in the training data
cor_nonmatch_test <- hamby224_test %>%
  select(rf_features, samesource) %>%
  filter(samesource == FALSE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  labs(x = "", y = "", fill = "Correlation",
       title = "Non-Match Comparisons")

# Create a title for the panel of plots
cor_title_test <- ggdraw() +
  draw_label("Correlation of Feature Variables in the Testing Data", 
             fontface = "bold", 
             size = 16,
             x = 0.5,
             hjust = 0.5)

# Create a joined legend for the panel of plots
cor_legend_test <- get_legend(cor_match_test + theme(legend.position = "right"))

# Create the panel of plots
plot_grid(cor_title_test, 
          plot_grid(cor_match_test, cor_nonmatch_test, cor_legend_test, 
                    ncol = 3, rel_widths = c(2, 2, 0.5)),
          ncol = 1,
          rel_heights = c(0.25, 3))

Random Forest Scores

Facet Version

The plot below is the model for the one that will be used in the app for exploring the lime explanations from the bullet matching data. This is the data from set 1 of the testing dataset.

# Heatmap of rfscore for each comparison in set 1
hamby224_test %>%
  filter(set == "Set 1") %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2,
             text = paste('Bullets Compared: ', bullet1, "-", land1, 
                          "vs", bullet2, "-", land2,
                          '\nRandom Forest Score: ', 
                          ifelse(is.na(rfscore), "Missing due to tank rash", rfscore)))) +
  geom_tile(aes(fill = rfscore)) +
  facet_grid(bullet2 ~ bullet1, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "darkgrey", high = "darkorange", midpoint = 0.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) +
  labs(x = "", y = "", fill = "RF Score")

Individual Plots Version

I also wrote some code to create the above plot using individual plots instead of facets. This is included below. (It is a bit outdated compared to my newer version of the plot.)

# Code for creating the heatmap plots using individual plots
p1 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 1", bullet2 == "Known 1") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p2 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 1", bullet2 == "Known 2") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p3 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 1", bullet2 == "Questioned") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  #facet_grid(bullet1 ~ bullet2, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p4 <- ggplot() + geom_blank() + theme_classic()

p5 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 2", bullet2 == "Known 2") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  #facet_grid(bullet1 ~ bullet2, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p6 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 2", bullet2 == "Questioned") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  #facet_grid(bullet1 ~ bullet2, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p7 <- ggplot() + geom_blank() + theme_classic()

p8 <- ggplot() + geom_blank() + theme_classic()

p9 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Questioned", bullet2 == "Questioned") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  #facet_grid(bullet1 ~ bullet2, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5, limits = c(0,1)) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score")

style(subplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, 
              nrows = 3, 
              titleX = TRUE, 
              titleY = TRUE, 
              margin = 0.03),
      hoverinfo = "skip",
      traces = 7)

Session Info

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2    plotly_4.8.0.9000 cowplot_0.9.3    
##  [4] forcats_0.3.0     stringr_1.3.1     dplyr_0.7.8      
##  [7] purrr_0.2.5       readr_1.1.1       tidyr_0.8.2      
## [10] tibble_1.4.2      ggplot2_3.1.0     tidyverse_1.2.1  
## 
## loaded via a namespace (and not attached):
##  [1] rgl_0.99.16             Rcpp_1.0.0             
##  [3] lubridate_1.7.4         lattice_0.20-38        
##  [5] zoo_1.8-3               assertthat_0.2.0       
##  [7] rprojroot_1.3-2         digest_0.6.18          
##  [9] mime_0.6                R6_2.3.0               
## [11] cellranger_1.1.0        plyr_1.8.4             
## [13] backports_1.1.2         evaluate_0.11          
## [15] httr_1.4.0              pillar_1.3.0           
## [17] rlang_0.3.1             lazyeval_0.2.1         
## [19] curl_3.2                readxl_1.1.0           
## [21] miniUI_0.1.1.1          rstudioapi_0.7         
## [23] data.table_1.11.8       TTR_0.23-3             
## [25] rmarkdown_1.10          labeling_0.3           
## [27] webshot_0.5.0           x3ptools_0.0.1         
## [29] htmlwidgets_1.3         munsell_0.5.0          
## [31] shiny_1.2.0             broom_0.5.0            
## [33] compiler_3.5.2          httpuv_1.4.5.1         
## [35] modelr_0.1.2            pkgconfig_2.0.2        
## [37] htmltools_0.3.6         tidyselect_0.2.5       
## [39] viridisLite_0.3.0       later_0.7.5            
## [41] crayon_1.3.4            withr_2.1.2            
## [43] grid_3.5.2              xtable_1.8-3           
## [45] nlme_3.1-137            jsonlite_1.6           
## [47] gtable_0.2.0            magrittr_1.5           
## [49] scales_1.0.0            cli_1.0.0              
## [51] stringi_1.2.4           reshape2_1.4.3         
## [53] promises_1.0.1          robustbase_0.93-2      
## [55] xml2_1.2.0              bulletr_0.1.0.9003     
## [57] smoother_1.1            xts_0.11-0             
## [59] tools_3.5.2             manipulateWidget_0.10.0
## [61] glue_1.3.0              DEoptimR_1.0-8         
## [63] hms_0.4.2               crosstalk_1.0.0        
## [65] yaml_2.2.0              colorspace_1.3-2       
## [67] rvest_0.3.2             knitr_1.20             
## [69] bindr_0.1.1             haven_1.1.2